Inside-Outside Reestimation from Partially Bracketed Corpora
نویسندگان
چکیده
1. MOTIVATION Grammar inference is a challenging problem for statistical approaches to natural-language processing. The most successful grammar inference techniques involve stochastic finite-state language models such as hidden Markov models (HMMs) [1]. However, finite-state language models fail to represent the hierarchical structure of natural language. Therefore, stochastic versions of grammar formalisms structurally more expressive are worth investigating. Baker [2] generalized the parameter estimation methods for HMMs to stochastic context-free grammars (SCFGs) [3] as the inside-outside algorithm. Unfortunately, the application of SCFGs and the insideoutside algorithm to natural-language modeling [4, 5, 6] has so far been inconclusive. Several reasons can be adduced for the difficulties. First, each iteration of the inside-outside algorithm on a grammar with n nonterminals may require O(nalwl 3) t ime per training sentence w, while each iteration of its finite-state counterpart training an HMM with s states requires at worst O(s2lwD t ime per training sentence. Second, the convergence properties of the algorithm sharply deteriorate as the number of nonterminal symbols increases. This fact can be intuitively understood by observing that the algorithm searches for the maximum of a function whose number of local maxima grows with the number of nonterminMs. Finally, although SCFGs provide a hierarchical model of the language, that structure is undetermined by raw text and only by chance will the inferred grammar agree with qualitative linguistic judgments of sentence structure. For example, since in English texts pronouns are very likely to immediately precede a verb, a grammar inferred from raw text will tend to together the subject pronoun with the verb. We describe here an extension of the inside-outside algor i thm that infers the parameters of a stochastic contextfree grammar from a partially parsed corpus, thus providing a tighter connection between the hierarchical structure of the inferred SCFG and that of the training corpus. The Mgorithm takes advantage of whatever constituent information is provided by the training corpus bracketing, ranging from a complete constituent analysis of the training sentences to the unparsed corpus used for the original inside-outside algorithm. In the latter case, the new algorithm reduces to the original one. Using a partiMly parsed corpus has several important advantages. We empirically show that the use of partially parsed corpus can decrease the number of iterations needed to reach a solution. We also exhibit cases where a good solution is found from partially parsed corpus but not from raw text. Most importantly, the use of partially parsed corpus enables the Mgorithm to infer grammars that derive constituent boundaries that cannot be inferred from raw text. We first outline our extension of the inside-outside algorithm to partially parsed text, and then report preliminary experiments illustrating the advantages of the extended algorithm. 2. P A R T I A L L Y B R A C K E T E D T E X T Informally, a partially bracketed corpus is a set of sentences annotated with parentheses marking constituent boundaries that any analysis of the corpus should respect. More precisely, we start from a corpus C consisting of bracketed strings, which are pairs c = (w, B) where w is a string and B is a bracketing of w. For convenience, we will define the length of the bracketed string c b y [ c [ = [ w I. Given a string w = wl . . .wlw [, a span o f w is a pair of integers (i, j ) wi th 0 _~ i < j _~ [w[. By convention, span (i,j) delimits substring iwj = wi+l . . .w j of w. We also
منابع مشابه
Reestimation and Best-First Parsing Algorithm for Probabilistic Dependency Grammars
This paper presents a reesthnation algorithm and a best-first parsing (BFP) algorithm for probabilistic dependency grummars (PDG). The proposed reestimation algorithm is a variation of the inside-outside algorithm adapted to probabilistic dependency grammars. The inside-outside algorithm is a probabilistic parameter reestimation algorithm for phrase structure grammars in Chomsky Normal Form (CN...
متن کاملTowards Automatic Grammar Acquisition from a Bracketed Corpus
1 I n t r o d u c t i o n Designing and refining a natural language grammar is a diiBcult and time-consuming task and requires a large amount of skilled effort. A hand-crafted grammar is usually not completely satisfactory and frequently fails to cover many unseen sentences. Automatic acquisition of grammars is a solution to this problem. Recently, with the increasing availability of large, mac...
متن کاملLexical Heads, Phrase Structure and the Induction of Grammar
Acquiring linguistically plausible phrase-structure grammars from ordinary text has proven difficult for standard induction techniques, and researchers have turned to supervised training from bracketed corpora. We examine why previous approaches have failed to acquire desired grammars, concentrating our analysis on the inside-outside algorithm (Baker, 1979), and propose that with a representati...
متن کاملFast Statistical Grammar Induction
The statistical induction of context free grammars from bracketed corpora with the Inside Outside Algorithm has often inspired researchers, but the computational complexity has made it impossible to generate a large scale grammar. The method we suggest achieves the same results as earlier research, but at a much smaller expense in computer time. We explain the modifications needed to the algori...
متن کاملStatistical Parsing with a Grammar Acquired from a Bracketed Corpus Based on Clustering Analysis
This work proposes a new method for learning a contextsensitive conditional probability context-free grammar from an unlabeled bracketed corpus based on clustering analysis, and introduces a natural language parsing model which uses a probability-based scoring function of the grammar to rank parses of a sentence. The method is superior to previous works (i.e., [ Collins, 1996 ] ) in the followi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1992